Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce default from 8 to 1 TVU receive socket/threads #998

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

steviez
Copy link

@steviez steviez commented Apr 23, 2024

Problem

We currently create 8 sockets/threads to pull out turbine shreds. Using 8 of these sockets is excessive, and actually leads to more inefficiencies (ie less densely packed PacketBatch) as some data below will show. We have a 1ms coalesce duration as part of the receiver, but given that we have 8 receivers all running independently, this 1ms coalesce doesn't mean much since they're all competing for packets. Part of work for #35

Summary of Changes

Reduce the default value of sockets/threads from 8 to 1.

One potential concern I could see for reducing the number of receivers would be if network load increased and over-burdened the single streamer. However, bench-streamer would indicate that the streamer can handle MUCH higher loads than what we're seeing

$ cargo run --release -- --num-recv-sockets 1
    Finished release [optimized] target(s) in 0.27s
     Running `/home/sol/src/agave/target/release/solana-bench-streamer --num-recv-sockets 1`
performance: 879154.6507920896 (PPS)

This is a micro-benchmark so its' results need to be noted with caution, but again, this is several orders of magnitude greater than the current mnb load of 2500-3000 inbound shreds per second.

Testing / Data

I've been running two nodes in an A/B setup; one node with the change and one without. The nodes were restarted with this change around 18:00 UTC on 2024/04/22. We'll examine shred_sigverify stats, as this is where the receivers all send their packets and where some of the gains can be observed.

For starters, we see similar values for num_packets, ignoring the restart spike around 18:00. This is not surprising / a basic sanity check:
image

However, num_batches is significantly smaller for the node running this branch. To simplify, this graphs shows num_packets / num_batches to yield packets-per-batch:
image

  • The tip of master node is seeing something like 1.16 packets per batch. So, something like one 2-packet-batch to every five 1-packet-batches.
  • The experiment node is showing ~3.5 packets per batch

More packets per batches means better locality, which we can see with a ~20% drop in total time spent via elapsed_micros:
image

And less total time with the same number of packets also means a similar ~20% drop in average spent time per packet

@steviez steviez changed the title Reduce from 8 to 1 TVU receive socket/thread Reduce default from 8 to 1 TVU receive socket/threads Apr 23, 2024
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.9%. Comparing base (57ab944) to head (cf6c65f).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #998   +/-   ##
=======================================
  Coverage    81.9%    81.9%           
=======================================
  Files         855      855           
  Lines      232154   232154           
=======================================
+ Hits       190170   190172    +2     
+ Misses      41984    41982    -2     

@behzadnouri
Copy link

We probably should rename these (or add some comments) because it is pretty confusing which of the many tvu sockets this is referring to. But it looks like this is the shred-fetch-stage sockets (i.e. multiple sockets all bound to contact_info.tvu() socket). is that correct?

It seems like the motivation to use multiple sockets (for the same port) is explained here: solana-labs#1228 and in particular to address this issue: solana-labs#1224

So performance in steady state aside, the issue is under heavy load socket queues are saturated and packets are dropped.

This commit effectively reverts #1228. Wouldn't that be a concern that socket queue saturation and issues like #1224 arise again?

@steviez
Copy link
Author

steviez commented Apr 23, 2024

We probably should rename these (or add some comments) because it is pretty confusing which of the many tvu sockets this is referring to.

I'm open to a wider rename to make things more specific; I opted to match existing nomenclature:

let (tvu_port, tvu_sockets) =
multi_bind_in_range(bind_ip_addr, port_range, num_tvu_sockets.get())
.expect("tvu multi_bind");

But it looks like this is the shred-fetch-stage sockets (i.e. multiple sockets all bound to contact_info.tvu() socket). is that correct?

Yep, that's right

@steviez
Copy link
Author

steviez commented Apr 23, 2024

... This commit effectively reverts solana-labs#1228. Wouldn't that be a concern that socket queue saturation and issues like solana-labs#1224 arise again?

Yeah, totally agree that we need to provision things to be able to withstand worst-case / heavy load scenarios. A few points:

  • I have not observed an increase in rcvbuf errors on my test node
    • I know this is steady state, was going to test in a private cluster where we can artificially increase shreds/block as well
  • Not that we want any drops in network buffers, but redundancy of the system has increased dramatically since that PR. Ie, the data/coding ratio from then is seen here
  • I'm a little hesitant to put too much weight into the graph from Occasional UDP packet drop on validator nodes solana-labs/solana#1224 as it relates to today's validator.
    • A Blob was nearly 65k bytes
    • Blob receive was not using recvmmsg, so the solana code was unable to pull packets out as fast as it can now
    • The graph shows ~1k packets-per-second, current MNB has ~27k. But, hard to compare these against each other given previously mentioned packet size differences
  • Lastly, if I'm not super familiar with the linux socket interface, but if socket queue capacity was the issue, I would expect us to be able to tweak that parameter instead of creating 8 sockets (that I believe must all be read) to spread the load

@behzadnouri
Copy link

Yeah, totally agree that we need to provision things to be able to withstand worst-case / heavy load scenarios. A few points:

so how can we ensure this change does not resurface previous issues?

Lastly, if I'm not super familiar with the linux socket interface, but if socket queue capacity was the issue, I would expect us to be able to tweak that parameter instead of creating 8 sockets (that I believe must all be read) to spread the load

sys-tuner used to do that:
https://github.com/anza-xyz/agave/blob/9b860abfc/sys-tuner/src/main.rs#L103-L107
and same instructions are part of the current validator docs:
https://docs.solanalabs.com/operations/setup-a-validator#linux

But I do not know if the mulit-socket trick was added before above instructions used or they were still needed in addition to those instructions.
Obviously it would be much better if we can achieve the same effect without overloading the runtime or spawning extra threads.

@bw-solana
Copy link

FYI - We have an invalidator setup where we can hit the cluster with ~1M TPS in aggregate of basic transfers. This manifests as ~20k TPS at the banking stage/replay level and 3k shreds per slot. In other words, 4-5x the typical mainnet traffic. Is that good enough to evaluate this change?

image

@steviez
Copy link
Author

steviez commented Apr 24, 2024

sys-tuner used to do that: https://github.com/anza-xyz/agave/blob/9b860abfc/sys-tuner/src/main.rs#L103-L107 and same instructions are part of the current validator docs: https://docs.solanalabs.com/operations/setup-a-validator#linux

Good callout, some further reading has me convinced this is the right knob:

rmem_default
The default setting of the socket receive buffer in bytes.

And FWIW, tuning those settings manually is effectively "required":

const INTERESTING_LIMITS: &[(&str, InterestingLimit)] = &[
("net.core.rmem_max", InterestingLimit::Recommend(134217728)),
(
"net.core.rmem_default",
InterestingLimit::Recommend(134217728),
),

Someone can opt out with --no-os-network-limits-test, but this is a scenario where we will not offer support:

agave/validator/src/main.rs

Lines 1749 to 1750 in 550f806

if !matches.is_present("no_os_network_limits_test") {
if SystemMonitorService::check_os_network_limits() {

But I do not know if the mulit-socket trick was added before above instructions used or they were still needed in addition to those instructions.

Let me see if I can do some digging to figure this out. However, I think my earlier comment is very important to note as well:

Blob receive was not using recvmmsg, so the solana code was unable to pull packets out as fast as it can now

Drops mean that packets were coming in faster than the validator could consume them. Even with a very large buffer, we theoretically would expect some drops on a long enough time scale. More threads were added to allow a higher overall consumption rate. But, using recvmmsg() would have optimized the single-threaded case, and was even listed as a possible solution on solana-labs#1224. At some point after multi-socket went in, the change was made to make this path use recvmmsg() anyways.

so how can we ensure this change does not resurface previous issues?

Supposing we run a test like the one Brennan mentioned where we have blocks that contain N more shreds than current MNB load, do you view that as a sufficient worst-case test for a large enough N?

@behzadnouri
Copy link

yeah, I think we just need to stress test this with some massive number of shreds at tvu port.
if we can also figure out how far it can be pushed until we observe some rcvbuf error and packet drop that would be great.

@steviez
Copy link
Author

steviez commented Apr 24, 2024

yeah, I think we just need to stress test this with some massive number of shreds at tvu port. if we can also figure out how far it can be pushed until we observe some rcvbuf error and packet drop that would be great.

Cool, think we're in agreement; will report back with some data once I get the chance to setup and run this experiment

A single streamer is capable of handling the load, and doing so with a
single thread is more efficient.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

4 participants